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[57] ABSTRACT 

A method and apparatus for supervised neural learning of 
time dependent trajectories exploits the concepts of adjoint 
operators to enable computation of the gradient of an 
objective functional with respect to the various parameters 
of the network architecture in a highly efficient manner. 
Specifically, it combines the advantage of dramatic reduc- 
tions in computational complexity inherent in adjoint meth- 
ods with the ability to solve two adjoint systems of equations 
together forward in time. Not only is a large amount of 
computation and storage saved, but the handling of real-time 
applications becomes also possible. The invention has been 
applied it to two examples of representative complexity 
which have recently been analyzed in the open literature and 
demonstrated that a circular trajectory can be learned in 
approximately 200 iterations compared to the 12000 
reported in the literature. A figure eight trajectory was 
achieved in under 500 iterations compared to 20000 previ- 
ously required. The trajectories computed using our new 
method are much closer to the target trajectories than was 
reported in previous studies. 

20 Claims, 4 Drawing Sheets 
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NEURAL NETWORK TRAINING BY 
INTEGRATION OF ADJOINT SYSTEMS OF 
EQUATIONS FORWARD IN TIME 

BACKGROUND OF THE INVENTION 
Origin of the Invention 

The invention described herein was made in the perfor- 
mance of work under a NASA contract, and is subject to the 
provisions of Public Law 96517 (35 USC 202) in which the 
contractor has elected not to retain title. 

Microfiche Appendix 

A computer program (microfiche, 26 pages) embodying 
the invention is listed in the microfiche appendix filed with 
this specification. The microfiche appendix contains mate- 
rial which is subject to copyright protection. The copyright 
owner has no objection to the facsimile reproduction by 
anyone of the patent document or the patent disclosure, as it 
appears in the Patent and Trademark Office patent file or 
records, but otherwise reserves all copyrights whatsoever. 
Technical Field 

The invention relates to methods for training neural 
networks and in particular to neural network training meth- 
ods using adjoint systems of equations corresponding to the 
forward sensitivity equations of the neural network. 
Background Art 

The following publications represent the state of the art in 
neural network training techniques, and are referred to in the 
specification below by author name and year: 

Barhen, J., Toomarian, N. and Gulati, S. (1900a) “Adjoint 
operator algorithms for faster learning and dynamical neural 
networks”. In David S. Touretzky (Ed.), Advances in Neural 
Information Processing Systems. Vol. 2, 498-508, San 
Mateo, Calif.: Morgan Kaufmann. 

Barhen, J., Toomarian, N. and Gulati, S. (1990b). “Appli- 
cation of adjoint operators to neural learning”. Applied 
Mathematical Letters , 3 (3), 13-18. 

Cacuci, D. G. (1981). “Sensitivity theory for nonlinear 
systems”. Journal Math. Phys., 22 (12), 2794—2802. 

Grossberg, S. (1987). The Adaptive brain. Vol. 2, North- 
Holland. 

Hirsch, M. W. (1989) “Convergent activation dynamics in 
continuous time networks”. Neural Networks, 2 (5), 
331-349. 

Maudlin, P. J, Parks, C. V. and Weber C. F. (1980). 
“Thermal-hydraulic differential sensitivity theory”. Ameri- 
can Society of Mechanical Engineering paper WA/HT-56. 

Narendra, K. S. and Parthasarathy, K. (1990). “Identifi- 
cation and control of dynamical systems using neural net- 
works”. IEEE transaction on Neural Networks, 1 (1), 4-27. 

Oblow, E. M. (1978). “Sensitivity theory for reactor 
thermal-hydraulic problems”. Nuclear Science and 
Engineering, 68, 322-357. 

Parlos, A. G., et. al. (1991). “Dynamic learning in recur- 
rent neural networks for nonlinear system identification”, 
preprint 

Pearlmutter, B. A. (1989). “Learning state space trajec- 
tories in recurrent neural networks”. Neural Computation, 1 
(2), 263-269. 

Pearlmutter, B. A. (1990). “Dynamic recurrent neural 
networks”. Technical Report CMU-CS-90-196, School of 
Computer Science, Carnegie Mellon University, Pittsburgh, 
Pa. Pineada, F. (1990). “Time dependent adaptive neural 
networks”.In David S. Touretzky (Ed.), Advances in Neural 
Information Processing Systems. Vol. 2, 710-718, San 
Mateo, Calif.: Morgan Kaufmann. 

Rumelhart, D. E., Hinton, G. E. and Williams, R. J. 
(1986). “Learning internal representations by error propa- 


2 

gation”. In D. E. Rumelhart, J. L. McCleland and the PDP 
Research Group, Parallel Distributed Processing: Explora- 
tion in the Microstructure of Cognition, Vol. 1, Foundations, 
Cambridge: MIT Press/Bradford Books. 

5 Sato, M. (1990). “A real time learning algorithm for 
recurrent analog neural networks”. Biological Cybernetics, 
62 (3), 237-242. 

Toomarian, N., Wacholder, E., and Kaizerman, S. (1987). 
“Sensitivity analysis of two-phase flow problems”. Nuclear 
Science and Engineering, 99 (1), 53-81. Toomarian, N. and 
Barhen, J. (1991). “Adjoint operators and non-adiabatic 
algorithms in neural networks”. Applied Mathematical 
Letters, 4 (2), 69-73. 

Werbos, P. J. (1990). “Backpropagation through time: 
what it does and how to do it”, Proceeding of the IEEE, 87 
15 ( 10 ). 

Williams, R. J., and Zipser, D. (1988). “A learning algo- 
rithm for continually running fully recurrent neural net- 
works”. Technical Report ICS Report 8805, UCSD, La Jolla, 
Calif. 92093. 

20 Williams, R. J., and Zipser, D. (1989). “A learning algo- 
rithm for continually running fully recurrent neural 
networks”, Neural Computation, 1 (2), 270-280. 

Zak, M. (1989). “Terminal attractors in neural networks”. 
Neural Networks, 2 (4), 259-274. 

25 1. INTRODUCTION 

Recently, there has been a tremendous interest in devel- 
oping learning algorithms capable of modeling time- 
dependent phenomena (Grossberg, 1987; Hirsh, 1989). In 
particular, considerable attention has been devoted to cap- 
30 turing the dynamics embedded in observed temporal 
sequences (e.g., Narendra, 1990; Parlos et al., 1991). 

In general, the neural architectures under consideration 
may be classified into two categories: 

Feedforward networks, in which back propagation 
35 through time (Werbos, 1990) can be implemented. This 
architecture has been extensively analyzed, and is 
widely used in simple applications due, in particular, to 
the straightforward nature of its formalism. 

Recurrent networks, also referred to as feedback or fully 
40 connected networks, which are currently receiving 
increased attention. A key advantage of recurrent net- 
works lies in their ability to use information about past 
events for current computations. Thus, they can provide 
time-dependent outputs for both time-dependent as 
45 well as time -independent inputs. 

One may argue that, for many real world applications, the 
feedforward networks suffice. Furthermore, recurrent net- 
work can, in principle, be unfolded into a multilayer feed- 
forward network (Rumelhart et al. 1986). A detailed analysis 
50 of the merits and demerits of these two architectures is 
beyond the scope of this paper. Here, we will focus only on 
recurrent networks. 

The problem of temporal learning can typically be for- 
mulated as a minimization, over an arbitrary but finite time 
55 interval, of an appropriate error functional. The gradients of 
the functional with respect to the various parameters of the 
neural architecture, e.g., synaptic weights, neural gains, etc. 
are essential elements of the minimization process and, in 
the past, major efforts have been devoted to the efficacy of 
60 their computation. Calculating the gradients of a system’s 
output with respect to different parameters of the system is, 
in general, of relevance to several disciplines. Hence, a 
variety of methods have been proposed in the literature for 
computing such gradients. A recent survey of techniques 
65 which have been considered specifically for temporal learn- 
ing can be found in Pearlmutter (1990). We will briefly 
mention only those which are relevant to our work. 
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Sato (1990) proposed, at the conceptual level, an algo- 
rithm based upon Lagrange multipliers. However, his algo- 
rithm has not yet been validated by numerical simulations, 
nor has its computational complexity been analyzed. Will- 
iams and Zipser (1989) presented a scheme in which the 5 
gradients of an error functional with respect to network 
parameters are calculated by direct differentiation of the 
neural activation dynamics. This approach is computation- 
ally very expensive and scales poorly to large systems. The 
inherent advantage of the scheme is the small storage 10 
capacity required, which scales as 0(N 3 ), where N denotes 
the size of the network. 

Pearlmutter (1989), on the other hand, described a varia- 
tional method which yields a set of linear ordinary differ- 
ential equations for backpropagating the error through the 15 
system. These equations, however, need to be solved back- 
wards in time, and require temporal storage of variables 
from the network activation dynamics, thereby reducing the 
attractiveness of the algorithm. Recently, Toomarian and 
Barhen (1991) suggested a framework which, in contradis- 20 
tinction to Pearlmutter’s formalism, enables the error propa- 
gation system of equations to be solved forward in time, 
concomitantly with the neural activation dynamics. A draw- 
back of this novel approach came from the fact that their 
equations had to be analyzed in terms of distributions, which 25 
precluded straightforward numerical implementation. 
Finally, Pineda (1990) proposed combining the existence of 
disparate time scales with a heuristic gradient computation. 
The underlying adiabatic assumptions and highly approxi- 
mate gradient evaluation technique, however, placed severe 30 
limits on the applicability of his method. 

SUMMARY OF THE INVENTION 

The invention trains a neural network using a method for 
calculating the gradients of an error functional with respect 35 
to the system’s parameters, which builds upon advances in 
nonlinear sensitivity theory (Oblow 1978, Cacuci 1981). In 
particular, it exploits the concept of adjoint operators to 
reduce the computational costs. Two novel systems of 
equations for error propagation (i.e., the adjoint equations), 40 
are at the heart of the computational framework. These 
equations are solved simultaneously (i.e., forward in time) 
with the network dynamics. The computational complexity 
of the algorithm scales as 0(N 3 ) per time step, the storage 
requirements are minimal i.e., of the order of 0(N 2 ), while 45 
complication arising from the presence of distributions in 
our earlier framwork are avoided. In the remainder of this 
specification, the terms sensitivity and gradient will be used 
interchangeably. 

The method of the invention trains a neural network so 50 
that a neuron output state vector thereof obeys a set of 
forward sensitivity equations over a finite repeatable learn- 
ing period, the method including setting the neuron output 
state vector to zero at the beginning of the learning period, 
defining first and auxiliary adjoint systems of equations 55 
governing an adjoint function and an auxiliary adjoint 
function, respectively, of the neural network and corre- 
sponding to the forward sensitivity equations, setting the 
adjoint function to zero at the beginning of the learning 
period and integrating the adjoint system of equations for- 60 
ward in time over the learning period to produce a first term 
of an indirect effect of a sensitivity gradient of the neural 
network, setting the auxiliary adjoint function to zero at the 
end of the learning period and integrating the auxiliary 
adjoint system of equations forward in time over the learn- 65 
ing period to produce a remaining term of the indirect effect, 
computing a sum of the first and remaining terms, multi- 


plying the sum by a learning rate and subtracting the product 
thereof from a current neuron parameter vector to produce 
an updated neuron parameter vector. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a diagram of a recurrent neural network 
employed in carring out the invention. 

FIG. 2 is a block diagram illustrating architecture which 
performs the process embodying the present invention for 
training in neural network by integration of adjoint systems 
of equations forward in time. 

FIG. 3, 4 and 5 illustrate different simulation results of a 
neural network learning a circular motion using the inven- 
tion. 

FIG. 6 is a graph of the error as a function of the number 
of learning iterations for each of the cases illustrated in 
FIG.’s 3-5. 

FIG.’s 7, 8 and 9 illustrate different simulation results of 
a neural network learning a figure-eight motion using the 
invention. 

FIG. 10 is a graph of the error as a function of the number 
of learning iterations for each of the cases illustrated in 
FIG.’s 7-9. 

DETAILED DESCRIPTION OF THE 
INVENTION 

2. TEMPORAL LEARNING FRAMEWORK 

We formalize a neural network as an adaptive dynamical 
system whose temporal evolution is governed by the fol- 
lowing set of coupled nonlinear differential equations: 


: <?n| 


y » Z 


t> 0 


(i) 


where u„ represents the output of the nth neuron (u„(0) being 
the initial state), and T nm denotes the strength of the synaptic 
coupling from the m-th to the n-th neuron. The constants k n 
characterize the decay of neuron activities. The sigmoidal 
functions g„(.) modulate the neural responses, with gain 
given by y n ; typically, g n (y n %)=tm h(y n %). In order to imple- 
ment a nonlinear functional mapping from an 
N 7 -dimensional input space to an N 0 -dimensional output 
space, the neural network is topographically partitioned into 
three mutually exclusive regions. As shown in FIG. 1, the 
partition refers to a set of input neurons S 7 , a set of output 
neurons S 0 , and a set of “hidden” neurons Note that this 
architecture is not formulated in terms of “layers”, and that 
each neuron may be connected to all others including itself. 

Let a(t) (in the sequel the overhead bar will denote a 
vector) be an N-dimensional vector of target temporal 
patterns, with non zero elements, a„(t), in the input and 
output sets only. When trajectories, rather than mappings, 
are considered, as we shall see in the sequel, components in 
the input set may also vanish. Hence, the time -dependent 
external input term in Eq. (1), i.e., I n (t), encodes component - 
contribution of the target temporal pattern via the expression 

( a n (t) if neS/ @) 

U0 = \ 

[0 if n e Sh fj Sq 


To proceed formally with the development of a temporal 
learning algorithm, we consider an approach based upon the 
minimization of an error functional, E, defined over the 
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learning period or time interval [t 0 , ty] by the following 
expression 


E(u, p ) = 





where the error component, e n (t), represents the difference 
between the desired and actual value of the output neurons, 
i.e., io 


a n (t)-u n (t) if ne S 0 

e n {t) = \ 

[O if ne.S( [JS H 


(4) 


15 


In our model, the internal dynamical parameters of inter- 
est are the strengths of the synaptic interconnections, T nm , 
the characteristic decay constants, k n , and the gain 
parameters, y n . They can be represented as a neuron param- 20 
eter vector of M [M=N 2 +2N] components 

P = {Tn, ■ ■ ■ ? m k 1; ■ ■ ■ > , y N } (5) 

We will assume that elements of p are statistically indepen- 2 5 
dent. Furthermore, we will also assume that, for a specific 
choice of parameters and set of initial conditions, a unique 
solution of Eq. (1) exists. Hence, the state variables u (which 
may be referred to as the neuron output state vector) are an 
implicit function of the parameters p. In the rest of this 30 
paper, we will denote the juth element of the vector p by 
(where the neuron parameter index pt=l, . . . , M). 

Traditionally, learning algorithms are constructed by 
invoking Lyapunov stability arguments, i.e., by requiring 
that the error functional be monotonically decreasing during 35 
learning time, t. This translates into 

M (6) 

dE dE dpp 

dr dpp dr 4n 

(i=i 


One can always choose, with a learning rate r|>0 


dpf, dE (7) 45 

dr ^dpp 


which implements learning in terms of an inherently local 
minimization procedure. Attention should be paid to the fact 50 
that Eqs. (1) and (7) may operate on different time scales 
(i.e., the neural network behavior time t of Equation 1 and 
the neural adaptation or learning time t of Equations 6 and 
7), with parameter adaptation occurring at a slower pace. 
Integrating the dynamical system, Eq. (7), over the interval 55 
[x, t+At], one obtains, 


dE 

pf dF 

pf dF 

Pf dF 

du 

(9) 

= 

dt = 

dt + 

— ■ 

dt 

d p [i » 

'to d Pn 

'to d Pf 

L an 

dPn 



This sensitivity expression has two parts. The first term in 
the Right Hand Side (RHS) of Eq. (9) is called the “direct 
effect”, and 

corresponds to the explicit dependence of the error func- 
tional on the system parameters. The second term in the RHS 
of Eq. (9) is referred to as the “indirect effect”, and corre- 
sponds to the implicit relationship between the error func- 
tional and the system parameters via u. In our learning 
formalism, the error functional, as defined by Eq. (3), does 
not depend explicitly on the system parameters; therefore, 
the “direct effect” vanishes, i.e., 

dF (10 a) 

=0 

dPa 


Since F is shown analytically (viz. Eqs. (3) and (4)), 
computation of <9F/chi is straightforward. Indeed 



Thus, to enable evaluation of the error gradient using Eq. 
(9), the “indirect effect” matrix c)u/3p should, in principle, be 
computed. In the sequel, we shall see that this is rather 
expensive from an algorithmic (i.e., computational 
complexity) perspective, but that an attractive alternative, 
based on the concept of adjoint operators, exists. First, 
however, we introduce the notion of teacher forcing. 

3. TEACHER FORCING 

A novel neural network “teacher forcing” training method 
is described in co-pending patent application Ser. No. 
07/908,677 filed Jun. 29, 1992 by the inventors herein and 
entitled “Fast Temporal Neural Learning Using Teacher 
Forcing”. As indicated in FIG. 2C of the above -referenced 
co-pending application, the parameters of the network are 
updated based upon the error accumulated over the length of 
the trajectory or learning period. This method is employed in 
numerical simulations of the present invention described in 
detail below. The present invention may be combined with 
such teacher forcing if desired, or the present invention may 
be practised without teacher forcing. In order to incorporate 
this teacher forcing into the neural learning formalism 
presented earlier, the time -dependent input to the neural 
activation dynamics, Eq. (1), i.e., I„(t) as given by Eq. (2), 
is modified to read: 


a n (t) 


if n e S( 


/»(0 = 1 0 


if n e S H 


(Ak(r)] 1 p [a n {t)-u n {t)f ifne S 0 


( 11 ) 


p+Ar dE (8) 

p ti (r+ A t) = p^-T] \ - — dr 

1 Jr P ji 


Equation (8) implies that, in order to update a system 
parameter p^, one must evaluate the “sensitivity” (i.e., the 
gradient) of E, Eq. (3), with respect to in the interval [t, 
t+At]. Furthermore, using Eq. (3) and observing that the 
time integral and derivative with respect to p^ commute, one 
can write 


At this state, X and (3 are assumed to be positive constants. 
The purpose of the term [a„(t)] 1_p is to insure that I„(t) has 
60 the same dimension as a n (t) and u n (t). Zak (1989) has 
demonstrated that in general, for |3=(2i+l)/(2j+l), i<j and i 
and j strictly positive integers, an expression of the form 
[a w -u„] p induces a terminal attractor phenomenon for the 
dynamics described in Eq. (1). Generally, |3=7/9for the 
65 numerical simulations reported below in this specification. 

When learning is successfully completed, [i.e., e n (t)=0] 
teacher forcing will vanish, and the network will revert to 
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the conventional dynamics given by Eqs. (1) and (2). As 
described in the above-referenced co-pending application, 
we suggest that X be modulated in time as a function of the 
error functional, according to 

X(x)=l-e~ E ^ (12) 

The above expression should be understood as indicating 
that, while X varies on the learning time scale, it remains at 
essentially constant levels during the iterative passes over 
the interval [t 0 , ty]. 

4. GRADIENT COMPUTATION ALGORITHMS 

The efficient computation of system response sensitivities 
(e.g., error functional gradients) with respect to all param- 
eters of a network’s architecture plays a critically important 
role in neural learning. In this section, we will first review 
two of the best currently available methods for computing 
these sensitivities, including an explicit approach for calcu- 
lating the matrix du/dp, and an alternative approach, based 
upon the concept of adjoint operators, which involves error 
back propagation through time. This will be followed by the 
details of a new method, which enables an efficient compu- 
tation of the sensitivities by solving two systems of adjoint 
equations forward in time. 

4.1 State-of-the-art Methods 

4.1.1 DIRECT APPROACH 

Let us differentiate the activation dynamics, Eq. (1), 
including the teacher forcing, Eq. (11), with respect to p^. 
We observe that the time derivative and partial derivative 
with respect to p M commute. Using the shorthand notation d( 
. • • )/3 p M -( ■ ■ . ),82 we obtain a set of equations to be referred 
to in the sequel as “Forward Sensitivity Equations” (FSEs): 


8 

total number of multiply-accumulates scales like N 4 L. 
Clearly, such a scheme exhibits expensive scaling 
properties, and would not be very practical for large net- 
works. On the other hand, since the FSEs are solved forward 
5 in time, along with the neural dynamics, the method also has 
inherent advantages. In particular, there is no need for a large 
amount of memory. Since u n has N 3 +2N 2 components, the 
storage requirement scales as 0(N 3 ). 

4.1.2 INDIRECT APPROACH 

10 In order to reduce the computational complexity associ- 
ated with the above technique for evaluating the “indirect 
effect” term in Eq. (9), an alternative approach can be 
considered. It is based upon the concept of adjoint operators, 
and eliminates the need to compute explicitly u, . The 
critical feature of this approach is that a single vector of 
adjoint functions, v, is obtained, by solving an 
N-dimensional system of equations once, not M times as 
previously. This vector contains all the information required 
for computing the sensitivities of the error functional, 
dE/dp^, with respect to all parameters, p . The necessary and 
sufficient conditions for constructing adjoint equations are 
discussed in Cacuci (1981). Adjoint equations can be 
derived in a number of manners, including variational, 
perturbation theoretic, differential and functional analytic 
3 techniques. Details of derivations, based upon the differen- 
tial approach, for example, can be found in Toomarian 
(1987) and Maudlin et al. (1980). 

It can be shown that an Adjoint System of Equations 
30 (ASEs) pertaining to the FSEs, Eq. (13), can be formally 
written as 


Mn,n + ^ nm u m,fi ~ ^n,fi t > 0 

m 

u„ ,, = 0 t = 0 


( 13 ) 


in which 

= (*„ - Jng'rfi l n / & wJA™ “ 7ng' n T nm 
S n ,n = + 7 ng’nYj U m^p jJL ,T nm + g' n \ ^ T nm U m + I n L 

m V m ) 


( 14 ) 

( 15 ) 


In the above expressions, g 1 ^ represents the derivative of g n 
with respect to its arguments, 6 denotes the Kronecker 
symbol and S n is defined as a nonhomogeneous “source”. 
The source term contains all explicit derivatives of the 
neural activation dynamics, Eq. (1), with respect to the 
system parameters, p^. Hence, it is parameter dependent and 
its size is (N%M). The initial conditions of the activation 
dynamics, Eq. (1), are excluded from the vector of system 
parameters p. Thus, the initial conditions of the FSEs will be 
taken as zero. Their solution will provide the matrix du/dp 
needed for computing the “indirect effect” contribution to 
the sensitivity of the error functional, as specified by Eq. (9). 
This algorithm is, essentially, similar to the scheme pro- 
posed by Williams and Zipser (1989). Computation of the 
gradients using the forward sensitivity formalism would 
require solving Eq. (13) M times, since the source term, S n ^, 
explicitly depends on p^. This system has N equations, each 
of which requires multiplication and summation over N 
neurons. Hence, the computational complexity, measured in 
terms of multiply-accumulates, scales like N 2 per system 
parameter, per time step. Let us assume, furthermore, that 
the interval [t 0 , ty] is discretized into L time steps. Then, the 


-V„ + Yj A lm V m =S’ n I > 0 

m 

35 


( 16 ) 


where the superscript “T” denotes transposition of the ASE 
matrix and the indices n and m range from 1 to N. In 
order to specify Eq. (16) in closed mathematical form, we 
4Q must define the source term, S*„, and the initial conditions 
for the system. (An initial condition may apply to the 
beginning of the learning period or the end of the learning 
period.) Both should be independent of u^ and its deriva- 
tives. As we will see in the sequel, a judicious choice of the 
45 source term, S*„, and of the initial conditions for the ASEs, 
Eq. (16), forms the cornerstone of the indirect methods. 
Generally, the source term, S*„, is selected in such a way, as 
to make its inner product with u^ identical to the “indirect 
effect” contributions to the sensitivities of the error func- 
5Q tional. Selection of the time (initial or final) at which the 
initial conditions are specified will, on the other hand, 
dictate the direction in time in which the ASEs should be 
integrated. For instance, if the set of initial condition is 
specified at t 0 , the ASEs will be integrated forward in time. 
55 On the other hand, if the initial conditions are given at t f , the 
ASEs must be integrated backward in time. In the remainder 
of section 4.1, we will derive and discuss the advantages and 
disadvantages of two algorithms, which integrate the ASEs 
backward and forward in time, respectively. 

60 a — Integration of the ASEs backward in time 

In order to construct an expression which can be used to 
eliminate the dependence of the “indirect effect” term of the 
sensitivities on u^, we have to form the inner product of the 
vectors S* and u^. This is done by multiplying the FSEs, Eq. 
65 (13), by \ T , and the ASEs, Eq. (16), by u /4 r , and by 
subtracting the two resulting equations. Then, integrating 
over the time interval [t 0 , ty], yields the bilinear form: 
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[v T u^] if - [v T u,^ = £ f (v T S,„)d, - £ f (u^r)dt 


(17) 


In the above equation denotes the inhomogeneous source 
term of the FSEs, Eq. (15), and [. . . ] r denotes the value of 
the expression in brackets evaluated at time t. By identifying 

S*„=dF/du„ (18a) 

the second integral in the RHS of Eq. (17) is seen to become 
identical to the “indirect effect” term in Eq. (9). Incorporat- 
ing the initial conditions of Eq. (13) into Eq. (17), we obtain: 


Is it possible to overcome the rather severe memory 
limitations of the adjoint method, while keeping its compu- 
tational benefits? We notice that Eq. (16) is linear in the 
variables v. Therefore, it is, in principle, possible to obtain 
5 identical contributions to Eq. (9) with an alternative choice 
for adjoint source and initial conditions. Indeed, let us select: 


_ dF 

S = -- vS(t-t f ) 
du 

10 

and 


(21a) 


v(t = 0) = 0 


(21b) 



dt- [v 7 ^]^ 


( 19 ) 15 where 6(t— t^) is the Dirac distribution. Inserting Eqs. (21) 
into Eq. (17) and recalling that, by definition of 8 


How can we eliminate the undesirable presence of u^ from 
the RHS of Eq. (19)? The clear choice is 20 


J tf u^vS(t-t f )dt= [v T u, M \ t=t ^ 


(22) 


v(t=t / )=0 


(18b) we obtain 


resulting in the fundamental identity 



dr 


25 


( 20 ) 


Since the ASEs, (Eqs. 16, and 18), and S^ (which is known 
by definition) are independent of u^, the RHS of Eq. (20) is 30 
independent of u ju. Hence, by solving the ASEs once only, 
the “indirect effect” is evaluated via the RHS of Eq. (20) for 
all system parameters, p. Thus, the above identity provides 
the basis for a computationally efficient algorithm for evalu- 
ating the gradient of the error functional. It is important to 35 
notice that since the initial conditions, Eq. (18b), for the 
ASEs, were given at trajectory end time, i.e., at t=t /? Eq. (16), 
is integrated backward in time, i.e., from t=t f to t=t 0 . 

We note that the above paradigm results in equations 
similar to those obtained by Pearlmutter (1989) using a 40 
variational approach. Both algorithms require that th neural 
activation dynamics, Eq. (1), be first solved forward in time 
(to provide quantities such as 


dF 
du ’ 


45 


S^) followed by a solution of the ASEs, Eq. (16), integrated 
backward in time. During the backward integration, the 50 
product of the vector v and matrix S^ is calculated and also 
integrated from t=t f to t=t 0 . The result of this integration 
provides the “indirect effect” contribution to the gradients of 
the error functional. 

Since u n (t) and v n (t) can be obtained at the cost of N 55 
multiply-accumulates per time step, and there are N equa- 
tions (neurons) per network, the complexity of this approach 
scales like N 2 L. Thus, the principal advantage of adjoint 
methods lies in the dramatic reduction of computational 
costs (e.g., at least 0(N 2 ) for an N-neuron network). 60 
However, a major drawback to date has come from the 
necessity to store quantities such as g 1 , S* and S at each 
time step. Hence, the memory requirements for this method 
scale as N 2 L. Note also that the actual evaluation of the RHS 
of Eq. (9) requires N 2 L multiply-accumulates, independent 65 
whether the FSEs or ASEs are used, 
b — Integration of the ASEs forward in time 



(23) 


This expression is identical to Eq. (20). Therefore, most 
items discussed in connection with this equation will hold. 
However, v is now the solution of the ASEs defined by Eqs. 
(16) and (21). In contradistinction to the previous approach, 
the initial conditions for the ASEs are given here at t=t 0 , Eq. 
(21b). Therefore, the ASEs can be integrated forward in 
time, concomitantly with the neural activation dynamics. 
Potentially, storage requirements are reduced by a large 
amount, since the quantities S* and g 1 are now immediately 
used at each time step to calculate v and its product with the 
known matrix S . Hence, the memory required scales only 
as 0(N 2 ). The computational complexity of this method also 
scales as 0(N 2 L). A potential drawback, however, lies in the 
fact that Eq. (16) now contains, from Eq. (21a), the Dirac 
distribution, 6(t— t^), operating directly on the adjoint 
functions, v. This precludes straightforward (stable) numeri- 
cal or VLSI implementation of such a scheme. 

4.2 The New Approach 

At this stage, we introduce a new paradigm which will 
enable us to evolve the adjoint dynamics, Eq. (16) forward 
in time, but without the difficulties associated with the above 
formalism. Consider, again, the bilinear form, Eq. (17), 
associated with the FSEs, Eq. (13), and the ASEs, Eq. (16). 
Let us select 

_ dF (24a) 

r = — 

du 


This is similar to the choice made earlier, when discussing 
the integration of the ASEs backward in time. The second 
integral in the RHS of Eq. (17) will again be identical to the 
“indirect effect” contribution to the sensitivity of the error 
functional, Eq. (9). But, in contradistinction to Eq. (18b), we 
now select as initial conditions: 

v(t=0)=0 (24b) 

This will allow us to integrate the resulting ASEs, i.e., Eqs. 
(16) and (24), forward in time. Combining Eqs. (9), (17) and 
(24), we obtain the following expression: 
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f f (S') T u tll dt = <tt- [v r «4 


(25) 


(^ +1 - 2 *) 


Ar 


+ [A r ]V = 0 


0 < l< L 


(31) 


The first term in the RHS of Eq. (25) can be computed by 
using the values of v resulting from the solution of Eqs. (16) 
and (24) forward in time. The main difficulty resides with the 
evaluation of the second term in the RHS of Eq. (25), i.e., 
[v 7 ^]^- To compute it, we now introduce an auxiliary 10 
adjoint system, formally similar to Eq. (16): 


where, superscript 1 implies that the numerical values of the 
quantities of interest at time step 1 are used and At denotes 
the size of the integration time step. Here, [A 7 ] 1 denotes the 
ASE matrix evaluated at the time step 1. From the above 
equation one can easily show that 

~z 1+1 =B l -B l ~ 1 . . . B 1 Bfyt t )=B n z(Q (32) 


in + ) ’ A nm Zm — S n t > 0 

m 


(26) where 

15 B l =I+d4A r i(l=0,l, . . . , L- 1) 


(33) 


Once again, the inner product of u and "S is required. The 
bilinear form associated with u^ and z is obtained in a 
similar fashion to the derivation of Eq. (17). Its expression 
is: 


and I denotes the identity matrix. The term B l may be 
referred to as a propagation kernel. Using a numerical 
approximation for the integral, the RHS of Eq. (29) can be 
recast as 


[z T u^] tf - k r «,^] fo 


£ 


7 T- 

(fs,. 


, )dt 



(27) 


IV = i ( t 0 ) 


I> 7 


A t 


(34) 


25 

By choosing 

~s=v{i)b{t-t f ) (28a) 

the last term of Eq. (27) reduced to [vu j f/ , which is the 30 
quantity of interest. If we furthermore select 

i(V)=0 (28b) 

and take into account the initial value of u M , Eq. (27) yields 35 


Algorithmically, the computation of [v 7 !^], can be 
described as follows. At each time step 1, the values of the 
matrices and S J, are calculated using Eqs. (14, 15 

and 33). The needed value of B (/_1)! is computed by multi- 
plying the stored value of B (/_2)! by B (/-1) . The result is not 
only stored for the next iteration, but also used to calculate 
the product of B (/_1)! by S J which, in turn, is added up. The 
initial conditions z(t 0 ) can easily be found at time ip i.e., at 
iteration step L, by solving the system of linear algebraic 
equations: 


f f (uli)d,= [#«,„] = f f (fs v) d, (29) 

Note that, even though we selected z(t / )=0, we are still 
interested in solving the auxiliary adjoint system, Eqs. (26) 
and (28), forward in time. Thus, the critical issue is how to 
select initial condition (i.e., z(t 0 )), that would result in 

z(V)=°- 

Let us first provide a simple qualitative illustration of how 
this problem can be approached. We assume, for the 
moment, that the matrix A in Eq. (26) is time independent. 
The formal solution of Eqs. (26) and (28) can then be written 
as: 

~z{t)=^ t ~ t ^~z(Qt<t f (30a) 

(30b) 

In principle, using Eq. (30a), the RHS of Eq. (29) can be 
expressed in terms of z(t 0 ). At time ip where v(t^) is known 
from the solution of Eqs. (16) and (24), one can calculate the 
vector z(t 0 ), from Eq. (30b), with z(t / )=0. 

In the problem under consideration, however, the ASE 
matrix Ain Eq. (26) is time dependent, (viz Eq. (14)). Thus, 
it is practical to integrate numerically the auxiliary adjoint 
equations. Usually, the same numerical scheme used for Eqs. 
(1) and (16) should be adopted. For illustrative purposes 
only, we limit the discussion in the sequel to a first order 
finite differences approximation, i.e., we rewrite Eqs. (26, 
28) as 


B^- 1 ) ! z(t 0 )=v(t / ) (35) 

To summarize, in this new method the computation of the 
gradients of the error functional (i.e., Eq. (9)), involves two 
40 stages, corresponding to the two terms in the RHS of Eq. 
(25). The first term is calculated from the adjoint functions, 
v, obtained by integrating Eqs. (16) and (24) forward in time 
along with the neural activation dynamics, Eq. (1). The 
corresponding computational complexity is 0(N 2 L). The 
45 second term is calculated via Eqs. (34-35), and involves two 
steps: a) kernel propagation, Eq. (34), which requires mul- 
tiplication of two matrices B 1 and B (/_1)! at each time step; 
hence, its computational complexity scales as N 3 L; b) solu- 
tion of the linear algebraic system (35) with computational 
50 complexity of 0(N 2 ). Thus, the overall computational com- 
plexity of this approach is 0(N 3 L). Notice, however, that 
here the storage needed is minimal and equal to 0(N 2 ). 

An architecture for performing the foregoing process is 
illustrated in the block diagram of FIG. 2. Each block in FIG. 
55 2 may be thought of as a separate dedicated processor, 
although the entire process is implementable with a single 
processor using the program of Appendix A. Referring to 
FIG. 2, the first step is to set the source term to the partial 
derivative of the error functional F with respect to the neuron 
60 output state vector and to set the adjoint function at the 
beginning of the learning period to zero (block 200). Then, 
the adjoint system of equations (Equation 16) is integrated 
forward in time (block 205) to produce the first of two 
components of the indirect effect of the sensitivity gradient 
65 of equation (9) (block 210 of FIG. 2) and the adjoint function 
evaluated at the end of the learning period (block 215). The 
next process includes blocks 220 through 265 of FIG. 2 and 
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corresponds to an integration of the auxiliary adjoint system 
of equations (Equation 26) forward in time and is performed 
contemporaneously with the integration step of block 205 . 
In block 220 , the integration of the auxiliary ASE’s over the 
learning period is divided into L time steps and the time step 
index 1 is set to zero (block 225 ). At each time step, the 
propagation kernel B is computed (block 230 ) and the time 
step index is incremented (block 235 ). The kernels are 
propagated at each time step by multiplying the current 
kernel by the product of the kernels of all previous time steps 
at block 240 and multipled by the derivative of the source 
term with respect to neuron parameters (block 245 ) and the 
result is summed over all time steps (block 250 ). The result 
of block 240 corresponding to the final time step is taken in 
block 250 for use in solving the system of Equation (35). 
This solution is indicated in FIG. 2 as a multiplication (block 
260 ) of the inverse of the propagation kernel of block 255 by 
the adjoint function at the end of the learning period (of 
block 215 ). However, it is understood that well-known 
iterative methods may be employed in solving the system of 
Equation 35 rather than inversing the propagation kernel. 
Finally, the results of blocks 250 and 260 are multiplied 
together (block 265 ) to produce the remaining component of 
the indirect effect of the sensitivity gradient of Equation 
(19). This remaining component (i.e., the product of block 
265 ) is added at block 270 to the first component of the 
indirect effect (from block 210 ), the resulting sum is mul- 
tiplied by the learning rate (block 275 ) and subtracted from 
the current parameter vector (block 280 ) to produce an 
updated parameter vector (block 285 ). Then, the time t is 
reset and the learning time t is incremented (block 290 ) and 
the entire process repeated. 

As a final remark, we wish to consider a further 
approximation, based upon the requirement of small At. The 
matrices B l , Eq. (33), become then diagonally dominant, and 
B /! can be approximated at each time step 1 by 

B l ' = i + At^A 1 ( 36 ) 


This implies that the computation al complexity of the 
proposed method could be further reduced to 0(N 2 L) for 
certain class of trajectories. At this stage such an approxi- 
mation is merely an idea which has to be sustained via 
simulations. 

5. Numerical Simulations 

The new learning paradigm, presented in the preceding 
section, has been applied to the problem of learning two 
trajectories: a circle and a figure eight. Results referring to 
these problems can be found in the literature (Pearlmutter 
1989), and they offer sufficient complexity for illustrating 
the computational efficiency of our proposed formalism. 

The network that was trained to produce these trajectories 
involved 6 fully connected neurons, with no input, 4 hidden 
and 2 output units. An additional “bias” neuron was also 
included. In our simulations, the dynamical systems were 
integrated using a first order finite difference approximation. 
The sigmoidal nonlinearity was modeled by a hyperbolic 
tangent. Throughout, the decay constants K n , the neural 
gains y„, and X were set to one. Furthermore, (3 was selected 
to be 7/9. For the learning dynamics , Ax was set to 6.3 and 
r| to 0.015873. The two output units were required to 
oscillate according to 

a 5 (t)=A sin wt (36a) 

a 6 (t)=A cos wt (36b) 
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for the circular trajectory, and, according to 

a 5 (t)=A sin wt (37a) 

a 6 (t)=A sin 2 wt (37b) 

5 

for the figure eight trajectory. Furthermore, we took A=0.5 
and co=l. Initial conditions were defined at t o =0. Plotting a 5 
versus a 6 produces the “desired” trajectory. Since the period 
of the above oscillations is 2jt, ty=2jt time units are needed 
10 to cover one cycle. We selected At=0.1, to cover one cycle 
in approximately 63 time steps. 

5.1 Circular Trajectory 

In order to determine the capability and effectiveness of 
the algorithm, three cases were examined. As initial 
15 conditions, the values of u„ were assumed to be uniform 
random numbers between -0.01 and 0.01 for the simulation 
studies referred in the sequel as “Case - 1” and “Case - 2”. 
For Case - 3, we set u„ equal to zero, except u 6 which was 
set to 0.5. The synaptic interconnections were initialized to 
20 uniform random values between -0.1 and +0.1 for all three 
experiments. 

CASE - 1 

The training was performed over t^b.5 time units (i.e., 65 
time intervals). A maximum number of 500 iterations was 
25 allowed. The results shown in FIG. 3 were obtained by 
starting the network with the same initial conditions, u„(0), 
as used for training, the learned values of the synaptic 
interconnections, T nm , and with no teacher forcing (X=0). As 
we can see, it takes about 2 cycles until the network reaches 
30 a consistent trajectory. Despite the fact that the system’s 
output was plotted for more than 15 cycles, only the first 2 
cycles can be distinguished. FIG. 6 demonstrates that most 
of the learning occurred during the first 300 interations. 
CASE - 2 

35 Here, we decided to increase the length of the trajectory 
gradually. A maximum number of 800 learning iterations 
was now allowed. The length of the training trajectory was 
65 time intervals for the first 100 iterations, and increased 
every 100 iterations by 10 time intervals. Therefore, it was 
40 expected that the error functional would increase whenever 
the length of the trajectory was increased. This was indeed 
observed, as may be seen from the learning graph, shown in 
FIG. 6. The output of the trained network is illustrated in 
FIG. 4 . Here again, from 15 recall cycles, only the first two 
45 (needed to reach the steady orbit) are distinguishable and the 
rest overlap. Training using greater trajectory lengths 
yielded a recall circle much closer to the desired one than in 
the previous case. From FIG. 6, one can see that the last 500 
iterations did not enhance dramatically the performance of 
50 the network. Thus, for practical purposes, one may stop the 
training after the first 300 iterations. 

CASE - 3 

The selection of appropriate initial conditions for u„ plays 
an important role in the effectiveness of the learning. Here, 
55 all initial values of u„ were selected to be exactly zero except 
the last unit, where u 6 =0.5 was chosen. This correspond to 
an initial point on the circle. The length of the trajectory was 
increased successively, as in the previous case. In spite of the 
fact that we allowed the system to perform up to 800 
60 iterations, the learning was essentially completed in about 
200 iterations, as shown in FIG. 6. The results of the 
network’s recall are presented in FIG. 5 , which shows an 
excellent match. 

5.2 Figure Eight Trajectory 

65 For this problem, the synaptic interconnections were 
initialized to uniform random values between -1 and 30 1. 
As initial conditions, the values of u„ were assumed to be 
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uniform random numbers between -0.01 and 0.01. The 
following three situations were examined. 

CASE - 4 

The training was performed over ty=6.5 time units (i.e., 65 
time intervals). A maximum number of 1000 iterations was 5 
allowed. The results shown in FIG. 7 were obtained by 
starting the network with the same initial conditions, u n (0), 
as used for training, the learned values of the synaptic 
interconnections, T nm , and with no teacher forcing (X=0). As 
we can see, it takes about 3 cycles until the network reaches 10 
a consistent trajectory. Despite the fact that the system’s 
output was plotted for more than 15 cycles, only the first 3 
cycles can be distinguished. 

CASE - 5 

Here, we again decided to increase the length of the 15 
trajectory gradually. A maximum number of 1000 iterations 
was now allowed. The length of the training trajectory was 
65 time intervals for the first 100 iterations, and was 
increased every 100 iterations by 5 time intervals. Therefore, 
it was again expected that the objective functional would 20 
increase whenever the length of the trajectory was increased. 
This was indeed observed, as may be seen from the learning 
graph, shown in FIG. 10. The output of the trained network 
is illustrated in FIG. 8. Here again, from 15 recall cycles, 
only the first three (needed to reach the steady orbit) are 25 
distinguishable, and the rest overlap. As a direct result of 
training using greater trajectory lengths, orbits much closer 
to the desired one than in the previous case were obtained. 
CASE - 6 

The learning in this case was performed under conditions 30 
similar to CASE - 5, but with the distinction that X was 
modulated according to Eq. (12). The results of the net- 
work’s recall are presented in FIG. 9, and demonstrate a 
dramatic improvement with respect to the previous two 
cases. 35 

It is important to keep in mind the following observations 
with regard to the simulation results: 

1) For the circular trajectory, X was kept constant through- 
out the simulations and not modulated according to Eq. (12). 

As we can see from FIG. 6, in cases 1 and 2, the error 40 
functional was not reduced to zero. Hence, a discrepancy in 
the functional form of the neural activation dynamics used 
during the learning and recall stages occurred. This was a 
probable cause for the poor performance of the network. In 
case 3, however, the error functional was reduced to zero. 45 
Therefore, the teacher forcing effect vanished by the end of 
the learning. 

2) For the figure eight trajectory, the differences between 
cases 5 and 6 lies in the modulation of X, (i.e., the amplitude 

of the teacher forcing). Even though in both cases the error 50 
functional was reduced to a negligible level, the effect of the 
teacher forcing in case 5 was not completely eliminated over 
the entire length of the trajectory. This points toward the fact 
that modulation of X not only reduces the number of 
iterations but also provides higher quality results. 55 

While the invention has been described detail by specific 
reference to preferred embodiments, it is understood that 
variations and modifications thereof may be made without 
departing from the true spirit and scope of the invention. 

What is claimed is: 60 

1. A method of training a neural network so that a neuron 
output state vector thereof obeys a set of forward sensitivity 
equations over a finite learning period, said method com- 
prising: 

defining first and auxiliary adjoint systems of equations 65 
governing an adjoint function and an auxiliary adjoint 
function, respectively, of said neural network; 


setting said adjoint function to zero at the beginning of 
said learning period and integrating said adjoint system 
of equations forward in time over said learning period 
to produce a first term of an indirect effect of a 
sensitivity gradient of said neural network; 

setting said auxiliary adjoint function to zero at the end of 
said learning period and integrating said auxiliary 
adjoint system of equations forward in time over said 
learning period to produce a remaining term of said 
indirect effect; 

computing a sum of said first and remaining terms, and 
multiplying said sum by a learning rate; and 

subtracting the product thereof from a current neuron 
parameter vector to produce an updated neuron param- 
eter vector. 

2. The method of claim 1 wherein said adjoint system of 
equations governs an adjoint function of said network rela- 
tive to a time dependent adjoint matrix of said network and 
a source function of said network over said learning period, 
and wherein said integrating said adjoint system of equa- 
tions produces an integral over said learning period of the 
adjoint function’s transpose multiplied by a derivative of 
said source function with respect to neuron parameters of 
said network. 

3. The method of claim 2 wherein said integrating said 
auxiliary adjoint system of equations comprises: 

computing a propagation kernel for each time step of said 
learning period from the product of said adjoint matrix 
of the current time step and an integrating time step 
size; 

multiplying each kernel by the source function evaluated 
at the current time step and accumulating a sum of 
products therefrom; and 

multiplying the sum of products by a function of (a) the 
product of all kernels of previous time steps and (b) the 
adjoint function evaluated at a final time step to pro- 
duce a final product. 

4. The method of claim 3 where said computing said 
multiplying to produce an updated neuron parameter matrix 
comprises computing a sum of said final product and said 
integral, multiplying said sum by a learning rate and sub- 
tracting the result thereof from a current neuron parameter 
vector to produce an updated neuron parameter vector. 

5. The method of claim 1 said neural network comprises 
an output layer of output neurons coupled to other neurons 
of said network, said method further comprising: 

setting outputs of said output neurons to values derived 
from a desired output training vector to perform teacher 
forcing. 

6. A method of training a neural network in accordance 
with an adjoint system of equations governing an adjoint 
function of said network relative to a time dependent adjoint 
matrix of said network and a source function of said network 
over a finite learning period, said method comprising: 

integrating said adjoint system of equations forward in 
time over said learning period to produce said adjoint 
function evaluated at the end of said learning period 
and to produce an integral over said learning period of 
the adjoint function’s transpose multiplied by a deriva- 
tive of said source function with respect to neuron 
parameters of said network; 

computing a propagation kernel for each time step of said 
learning period from the product of said adjoint matrix 
of the current time step and an integration time step 
size; 

multiplying each kernel by the source function evaluated 
at the current time step and accumulating a sum of 
products therefrom; 
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multiplying the sum of products by a function of (a) the 
product of all kernels of previous time steps and (b) the 
adjoint function evaluated at a final time step to pro- 
duce a final product; 

computing a sum of said final product and said integral, 
and multiplying said sum by a learning rate; and 
subtracting the result thereof from a current neuron 
parameter vector to produce an updated neuron param- 
eter vector. 

7. The method of claim 6 said neural network comprises 
an output layer of output neurons coupled to other neurons 
of said network, said method further comprising: 

setting outputs of said output neurons to values derived 
from a desired output training vector at the beginning 
of said learning period to perform teacher forcing. 

8. Apparatus for training a neural network so that a neuron 
output state vector thereof obeys a set of forward sensitivity 
equations over a finite learning period, said apparatus com- 
prising: 

means for setting said neuron output state vector to zero 
at the beginning of said learning period; 
means for defining first and auxiliary adjoint systems of 
equations governing an adjoint function and an auxil- 
iary adjoint function, respectively, of said neural net- 
work; 

means for setting said adjoint function to zero at the 
beginning of said learning period and integrating said 
adjoint system of equations forward in time over said 
learning period to produce a first term of an indirect 
effect of a sensitivity gradient of said neural network; 
means for setting said auxiliary adjoint function to zero at 
the end of said learning period and integrating said 
auxiliary adjoint system of equations forward in time 
over said learning period to produce a remaining term 
of said indirect effect; 

means for computing a sum of said first and remaining 
terms, and multiplying said sum by a learning rate; and 
subtracting the product thereof from a current neuron 
parameter vector to produce an updated neuron param- 
eter vector. 

9. The apparatus of claim 8 wherein said adjoint system 
of equations governs an adjoint function of said network 
relative to a time dependent adjoint matrix of said network 
and a source function of said network over said learning 
period, and wherein said means for integrating said adjoint 
system of equations produces an integral over said learning 
period of the adjoint function’s transpose multiplied by a 
derivative of said source function with respect to neuron 
parameters of said network. 

10. The apparatus of claim 9 wherein said means for 
integrating said auxiliary adjoint system of equations com- 
prises: 

means for computing a propagation kernel for each time 
step of said learning period from the product of said 
adjoint matrix of the current time step and an integra- 
tion time step size; 

means for multiplying each kernel by the source function 
evaluated at the current time step and accumulating a 
sum of products therefrom; and 
means for multiplying the sum of products by a function 
of (a) the product of all kernels of previous time steps 
and (b) the adjoint function evaluated at a final time 
step to produce a final product. 

11. The apparatus of claim 10 wherein said means for 
computing and multiplying to produce an updated neuron 
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parameter matrix comprises means for computing a sum of 
said final product and said integral, means for multiplying 
said sum by a learning rate and means for subtracting the 
result thereof from a current neuron parameter vector to 
5 produce an updated neuron parameter vector. 

12. Apparatus for training a neural network in accordance 
with an adjoint system of equations governing an adjoint 
function of said network, a time dependent adjoint matrix of 
said network and a source function of said network over a 
finite learning period, said apparatus comprising: 

means for integrating said adjoint system of equations 
forward in time over said learning period to produce 
said adjoint function evaluated at the end of said 
learning period and to produce an integral over said 
learning period of the adjoint function’s transpose 
15 multiplied by a derivative of said source function with 
respect to neuron parameters of said network; 

means for computing a propagation kernel for each time 
step of said learning period from the product of said 
adjoint matrix of the current time step and an integra- 
20 tion time step size; 

means for multiplying each kernel by the source function 
evaluated at the current time step and accumulating a 
sum of products therefrom; 

means for multiplying the sum of products by a function 
25 of (a) the product of all kernels of previous time steps 
and (b) the adjoint function evaluated at a final time 
step to produce a final product; 

means for computing a sum of said final product and said 
integral, and multiplying said sum by a learning rate; 
30 and 

subtracting the result thereof from a current neuron 
parameter vector to produce an updated neuron param- 
eter vector. 

13. A method of training a neural network in successive 
35 learning periods to respond to an input vector which is 

variable in time rapidly relative to said successive learning 
periods, in which a neuron parameter vector of said network 
is modified at the end of each learning period by combining 
it with an indirect effect function comprising a gradient of an 
40 error functional with respect to said neuron parameter 
vector, said error functional comprising a difference between 
a neuron output state vector and a target output state vector 
which is variable in time rapidly relative to said successive 
training periods, said network being characterized by a 
45 matrix of coupling coefficients defining coupling between 
respective neurons of said network, said method comprising 
the following steps carried out during each one of said 
learning periods: 

defining first and second equations governing, 
50 respectively, first and second variables, terms of each of 
said equations comprising a product of the respective 
variable and a coupling term comprising said matrix of 
coupling coefficients, a derivative of the respective 
variable with respect to time and a source term which, 
55 in said first equation, comprises a gradient of said error 
functional with respect to said neuron output state 
vector; 

setting said first and second variables to zero at the 
beginning and end, respectively, of said learning 
60 period; 

integrating said first equation forward in time from the 
beginning to the end of said one learning period to 
produce a first term of said indirect effect function and 
integrating said second equation forward in time from 
65 the beginning to the end of said one learning period to 
produce a second term of said indirect effect function; 
and 
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computing a sum of said first and second terms, and 
multiplying said sum by a learning rate; and 

combining the product thereof with a current version of 
said neuron parameter vector to produce a revised 
version of said neuron parameter vector at the end of 5 
said one learning period. 

14. The method of claim 13 wherein said combining step 
comprises subtracting said product from said current version 
of said neuron parameter vector. 

15. The method of claim 13 wherein in said second 10 
equation said source term is equal to said variable of said 
first equation at the end of said learning period. 

16. The method of claim 13 wherein the step of integrat- 
ing said second equation forward in time comprises: 

dividing said learning period into plural time steps and at 15 
each time step: 

(A) computing a propagation kernel of the current time 

step from a product of said coupling term of the 
current time step and a time interval of said plural 
time steps; 20 

(B) multiplying the kernel of the current time step by 
the source term of said second equation of the 
current time step and accumulating a sum of prod- 
ucts therefrom; and 

(C) multiplying said sum of products by a function of 
(a) the products of the kernels of the preceding time 
steps and (b) the variable of said first equation. 

17. The method of claim 16 wherein in step (C) (b), said 
variable of said first equation has been evaluated at the final 
one of said time steps. 

18. A method of training a neural network in successive 
learning periods to respond to an input vector which is 
variable in time rapidly relative to said successive learning 
periods, in which a neuron parameter vector of said network 

is modified at the end of each learning period by combining 35 
it with an indirect effect function comprising a gradient of an 
error functional with respect to said neuron parameter 
vector, said error functional comprising a difference between 
a neuron output state vector and a target output state vector 
which is variable i time rapidly relative to said successive 40 
training periods, said network being characterized by a 
matrix of coupling coefficients defining coupling between 
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respective neurons of said network, said method comprising 
the following steps carried out during each one of said 
learning periods: 

defining an equation governing a variable, terms of said of 
said equation comprising a product of the variable and 
a coupling term comprising said matrix of coupling 
coefficients, a derivative of the variable with respect to 
time and a source term which comprises a gradient of 
said error functional with respect to said neuron output 
state vector; 

setting said first variable to zero at the beginning of said 
learning period; 

integrating said equation forward in time from the begin- 
ning to the end of said one learning period to produce 
a first term of said indirect effect function; 
dividing said learning period into plural time steps and at 
each time step: 

(A) computing a propagation kernel of the current time 
step from a product of said coupling term of the 
current time step and a time interval of said plural 
time steps; 

(B) multiplying the kernel of the current time step by an 
other source term of the current time step and 
accumulating a sum of products therefrom; 

(C) multiplying said sum of products by a function of 
(a) the products of the kernels of the preceding time 
steps and (b) said variable; and 

computing a sum of said first and second terms, and 
multiplying said sum by a learning rate; and 
combining the product thereof with a current version of 
said neuron parameter vector to produce a revised 
version of said neuron parameter vector at the end of 
said one learning period. 

19. The method of claim 18 wherein the step (C) (b), said 
variable has been evaluated at the final one of said time 
steps. 

20. The method of claim 18 wherein said combining step 
comprises substracting said product from said current ver- 
sion of said neuron parameter vector. 



